home *** CD-ROM | disk | FTP | other *** search
- From: chughes@maths.tcd.ie (Conrad Hughes)
- Subject: Re: fast divide
- Date: 31 Oct 91 13:15:24 GMT
-
- DEFFNdiv(A,B,C,R):LOCALA%:[OPTpass:MOVS C,A,LSR#31:RSBNE A,A,#0
- :ADDS C,C,A,LSL#16:MOV A,A,LSR#16:ADDS A,B,A:SUBCC A,A,B:]
- FORA%=1TO31:[OPTpass:ADCS C,C,C:ADCS A,B,A,LSL#1:SUBCC A,A,B:]NEXT
- [OPTpass:ADCS R,C,C:RSBCC R,R,#0:]=0
-
- This is the BASIC assembler code to do the division.. Use it this way:
-
- FNdiv(reg_top,reg_bot,reg_misc,reg_res)
-
- while in the BASIC assembler.. It inlines the code. Register reg_res
- contains 32768*(reg_top/reg_bot); reg_bot _must_ be negative (if you
- don't know what sign it'll be, tack the following in at the beginning:
-
- CMP B,#0:RSBPL B,B,#0:RSBPL A,A,#0
-
- ...) , and reg_top can be either sign.. If you need slightly higher
- accuracy (I think I can increase the accuracy by half a bit) and
- don't worry about registers, tell me. reg_res can be any register,
- including reg_top, reg_bot or reg_misc.
-
- If you want to turn it into a loop, do so.. But remember that the
- fundamental unit of of the routine takes 3 clock counts, a branch-loop
- takes five! Also, it'll be straightforward enough to turn into a
- subroutine..
-
- If you need any clarification of it, ask me...
-
- Someone posted another version which is faster for a logarithmic
- distribution of numbers, with 0 and 1 occuring most frequently; I
- haven't got it to hand, but could dig it up if nobody else posts
- it.
-
- Conrad
-
-
-
- From: torbenm@diku.dk (Torben AEgidius Mogensen)
- Date: 31 Oct 91 15:34:21 GMT
- Subject: Re: fast divide
- Sender: torbenm@freke.diku.dk
-
- Here is one I cooked up. It is from memory, so there might be slight
- errors. The comparisons before the first SUB ensure that no overflow
- happens on the ASL. Worst case time is 5 cycles per bit.
-
- Arguments in p,q.
- on exit, r = p div q, p = p mod q
-
- .div
- MOV r,#0
- CMP p,q
- BLT nodiv
- CMP p,q ASL #1
- BLT div0
- CMP p,q ASL #2
- BLT div1
- ... \ repeat for 3..30
- CMP p,q ASL #31
- SUBGE p,q ASL #31
- ADDGE r,#2^31
- CMP p,q ASL #30
- .div30
- SUBGE p,q ASL #30
- ADDGE r,#2^30
- CMP p,q ASL #29
- .div29
- ... \ repeat for 29..1
- .div0
- SUBGE p,q ASL #0 \ equiv. to SUBGE p,q
- ADDGE r,#2^0
- .nodiv
- MOV PC,Link
-
-
-
-
-
- From: gcwillia@daisy.waterloo.edu (Graeme Williams)
- Date: 2 Nov 91 20:13:28 GMT
- Subject: Re: fast divide
- Organization: University of Waterloo
-
- Ok, here is the fastest unsigned divide routine I know of. The 3 instruction
- loop (unrolled) wasn't my idea - but the piece of code that figures out
- whereabouts to jump into the unrolled loop is.
-
- On entry d and n should be both +ve.
-
- CMP d,#0 : BEQ end% ; Remove zero divisors
- CMP d,n : MOVGT m,n : MOVGT n,#0 : BGT end% ; Exit if denominator > numerator
-
- ; This next bit of code figures out roughly how many bits the quotient is
- ; going to have and hence how many times you need to run through the
- ; divide loop proper - its only worth finding out to the nearest 4 times
-
- .start%
- MOV cnt,#28 : MOV m,n,LSR#4
- CMP d,m,LSR#12 : SUBLE cnt,cnt,#16 : MOVLE m,m,LSR#16
- CMP d,m,LSR#4 : SUBLE cnt,cnt,#8 : MOVLE m,m,LSR#8
- CMP d,m : SUBLE cnt,cnt,#4 : MOVLE m,m,LSR#4
-
- MOV n,n,LSL cnt
- ADDS n,n,n : RSB d,d,#0 ; Have to flip the sign of d for the divide loop
- ADD cnt,cnt,cnt,LSL#1 ; Multiply cnt by 3
- ADD pc,pc,cnt,LSL#2 ; Jump cnt instructions
- MOV R0,R0 ; Dummy instruction to take care of pipelining
-
- 32 copies of { ADCS m,d,m,LSL#1 : SUBCC m,m,d : ADCS n,n,n }
-
- .end%
-
- The best times occur when the numerator and denominator of your division
- are of similar size. The worst times when the numerator is very large
- and the denominator small. (The case where the denominator is larger than
- the numerator is killed off very early.)
-
- Suppose one has a distribution of numbers wherein the highest bit of
- the numbers is equally distributed between bit 0 and bit 31, that is,
- there is the same number of numbers between 2^n and 2^(n+1) for every n.
- (I think this is actually the case for most purposes)
-
- Now if a numerator and denominator are selected at random from this
- distribution subject only to the requirement that denom <= numer
- then the above routine will execute on average in 58.25 cycles
- (from the point start% in the code).
-
- The worst case is 116 cycles (top bit of quotient is bit 28 or higher)
- the best case is 32 cycles (top bit of quotient is bit 0 to 3)
- The inbetween cases get 12 cycles worse for every 4 extra bits in the
- quotient.
-
- Note that the average case isn't the simple average of the various cases
- but has to be weighted toward the better cases - this is because there
- are more ways (27) for numer and denom to differ by say 5 bits than there
- are (7) for them to differ by say 25 bits.
-
- Have fun.
-
- Graeme Williams
-
-
-
- From: zrzm0370@rusmv1.rus.uni-stuttgart.de (Joerg Scheurich)
- Subject: integer division
- Date: 10 Jun 91 11:25:36 GMT
-
- Hi
-
- So i know, there is no Assembler-command for integer-division in the
- ARM. Therefor a fast division-routine is a thing of common interest.
-
- In a earlier Posting i wrote about the bad quick hacked division-routine
- for the demo of my formula-tool (math. formula ==> assemblermacros).
-
- Now i wrote a better routine, but i don't know, if there is a better
- algorithm.
-
- The shorter and slower 32-Bit/32-Bit-division-routine requires
- 382 Cycles in 100 Bytes, with usage of 4 32-Bit-registers and 4 Byte memory.
- With unrolling the loop,
- the faster and longer 32-Bit/32-Bit-division-routine requires
- 182 Cycles in 692 Bytes, with usage of 4 32-Bit-registers.
-
- This is very faster then the old routine (which requires 2**33 Cycles in
- worst case and requires all time of the universe, if there is a division
- by zero ... )
-
- Cause the INTERNAL-assembler-command of the INTEL 8086
- for a 32-Bit/16-Bit-division requires
- 165-184 Cycles in 2-4 Bytes, with usage of 3 16-Bit-registers,
- i think, the result is ok.
-
- The result is nothing against the INTERNAL-assembler-command of the ARM
- for a 32-Bit*32-Bit-multiplication which requires
- 17 Cycles in 4 Bytes, with usage of 1-3 32-Bit-registers,
- but my Professor in digital electronics used to say:
- "If a program contain much divisions, the programmer is a idiot" ...
-
- Know someone a better (faster or shorter) algorithm ?
-
- so long
- MUFTI
-
- ( internet: zrzm0111@helpdesk.rus.uni-stuttgart.de
- ( from janet: zrzm0111%de.uni-stuttgart.rus.helpdesk@NFS-RELAY )
- bitnet: ZRZM AT DS0RUS1I )
-
- div-fast:
- ---------
- ; division routine
- ; 1. calculate flag if result is negate and convert operands to positiv
- ; 2. second "divide" with unsigned suczessive Approximation
- ; 3. fix the sign of result
-
- .MACRO INTdivide
- LDR R2,%2
- LDR R3,%3
-
- MOV R0,#0
- CMP R2,#0
- RSBLT R2,R2,#0
- SUBLT R0,R0,#1
- CMP R3,#0
- RSBLT R3,R3,#0
- MVNLT R0,R0
-
- MOV R1,#0
-
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l2
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l3
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l4
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l5
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l6
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l7
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l8
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l9
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l10
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l11
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l12
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l13
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l14
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l15
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l16
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l17
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l18
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l19
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l20
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l21
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l22
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l23
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l24
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l25
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l26
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l27
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l28
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l29
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l30
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l31
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- \l32
- ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
-
- CMP R0,#0
- RSBNE R2,R2,#0
- STR R2,%1
- .ENDM
-
- div-short:
- ----------
- ; division routine
- ; 1. calculate flag if result is negate and convert operands to positiv
- ; 2. second "divide" with unsigned suczessive Approximation
- ; 3. fix the sign of result
-
- .MACRO INTdivide
- LDR R2,%2
- LDR R3,%3
-
- MOV R0,#0
- CMP R2,#0
- RSBLT R2,R2,#0
- SUBLT R0,R0,#1
- CMP R3,#0
- RSBLT R3,R3,#0
- MVNLT R0,R0
- STR R0,minusflag
-
- MOV R0,#32.
- MOV R1,#0
- \loop ADDS R2,R2,R2
- ADCS R1,R1,R1
- CMP R1,R3
- SUBGE R1,R1,R3
- ADDGE R2,R2,#1
- SUB R0,R0,#1
- CMP R0,#0
- BNE \loop
-
- LDR R0,minusflag
- CMP R0,#0
- RSBNE R2,R2,#0
- STR R2,%1
- .ENDM
-
-
-
- From: chughes@maths.tcd.ie (Conrad Hughes)
- Subject: Re: integer division
- Date: 11 Jun 91 14:10:31 GMT
-
- zrzm0370@rusmv1.rus.uni-stuttgart.de (Joerg Scheurich) writes:
- >382 Cycles in 100 Bytes, with usage of 4 32-Bit-registers
- >and 4 Byte memory.
- >With unrolling the loop,
- >the faster and longer 32-Bit/32-Bit-division-routine requires
- >182 Cycles in 692 Bytes, with usage of 4 32-Bit-registers.
-
- This divides a 64bit number by a 32bit one, with 32bit result, in 97
- instructions/388 bytes/97 clock cycles..
-
- Entry: A=MSW of 64bit no., B=LSW of 64bit no., C=32bit no.
- C _must_ be negative (if it's positive, RSB C,C,#0),
- A _must_ be positive (if not, RSBS B,B,#0:RSC A,A,#0)
- ADDS B,B,B
- ADCS A,C,A,LSL#1:SUBCC A,A,C:ADCS B,B,B [repeat these three 32 times]
- Exit: A=Remainder of division, B=result, C unchanged.
- ..if you don't need the remainder, get rid of the last
- SUBCC A,A,C.
- The result will be unsigned, and flags will (obviously)
- reflect the last addition.
- Adjust as you will to get 32/32 division (Initialise A to 0, or AB
- = numerator shifted some way) or any other variety you like..
- Rolling up the loop it could get slightly complicated since every
- instruction requires flag information from the previous one ;-)
- Something like SUB count,count,#1:TEQ count,#0:BNE loop with count
- initialised to #32 might do the trick, but it'll slow it down to at
- least 288 cycles (so what if it only uses 32 bytes?? ;-}), which is
- pretty unacceptable unless you're on an ARM3..
-
- This is off the top of my head and I'm nowhere near the Arc, so if
- it doesn't work (it should - I've typed it in enough times now..)
- flame me..
-
- Conrad
-
-
-
- From: gcwillia@daisy.waterloo.edu (Graeme Williams)
- Subject: Re: really FAST integer division
- Summary: Avg. exec. time < 60 cycles - the fastest div in the West?
- Date: 28 Jun 91 19:07:58 GMT
- Organization: University of Waterloo
-
- Following several posts regarding division routines, in particular one
- with a lovely 3 instr. central loop I modified my old division routine
- into the following (supersonic, lightning fast, ZR rated, :-) )
- 32bit routine:
-
- ;Essentially what's happening here is we are figuring out how far left
- ;we can shift the numerator before any actual dividing begins and thus save
- ;a significant number of useless trips through the (unrolled) 3 instr. loop.
- ; - Note we only go to the nearest 4 bits, trying to save another 2 bits
- ; results in 3 extra instrs. for a saving of 6 instrs. half the time
- ; and 0 instrs. the other half - i.e. no nett benefit.
-
- MOV cnt,#28 : MOV mod,num,LSR#4
- CMP den,mod,LSR#12 : SUBLE cnt,cnt,#16 : MOVLE mod,mod,LSR#16
- CMP den,mod,LSR#4 : SUBLE cnt,cnt,#8 : MOVLE mod,mod,LSR#8
- CMP den,mod : SUBLE cnt,cnt,#4 : MOVLE mod,mod,LSR#4
- MOV num,num,LSL cnt : RSB den,den,#0
-
- ADDS num,num,num
- ;Now skip over cnt copies of the 3 instr. loop.
- ADD cnt,cnt,cnt,LSL#1 : ADD pc,pc,cnt,LSL#2 : MOV R0,R0
-
- {ADCS mod,den,mod,LSL#1 : SUBLO mod,mod,den : ADCS num,num,num} x 32 copies
-
-
- The routine behaves as follows:
- On Entry: num - is a signed numerator A (but +ve)
- den - is a signed denominator B (but non-zero and +ve)
- On Exit: mod - contains A MOD B
- num - contains A DIV B
-
- The routine requires 4 registers and 452 bytes of memory.
-
- Execution time:
- Given numbers A and B (both +ve) chosen at random from a distribution in
- which the most significant bit of the numbers is uniformly distributed
- between bits 0 and 31 (i.e. the density of numbers goes as roughly log n)
- then the average execution time is:
-
- For A < B : 32 cycles (no division actually required)
- For A >= B: 58.25 cycles
-
- The best and worst cases are 32 cycles and 116 cycles respectively.
-
-
- Have fun. (Stuff like this ought to be worth good money!)
-
- Graeme Williams
- gcwillia@daisy.waterloo.edu
-
- P.S. With any luck there's only a couple of typos :-).
-
-
-
- From: dseal@armltd.uucp (David Seal)
- Subject: Re: Dividing
- Date: 5 Mar 91 19:47:12 GMT
-
- In article <8828@castle.ed.ac.uk> ecwu61@castle.ed.ac.uk (R Renwick) writes:
-
- > Can anyone help me with this problem:
- >In my super-duper, wire-frame model animating 'C' program, I represent
- >floating point numbers as 4-byte integers. This I do by shifting the number
- >being represented by 15 bits (multiply by 32768) in order that I can keep a
- >track of the fractional part. The reason for doing this is that I want
- >to make sure my program runs as fast as possible by not using floating
- >point arithmetic...
- >
- > What I want to do is do a divide a number by another number
- >and not lose alot of the precision.
- > The way I do this at the moment is by
- > result=((a/b)<<15)
- >But if a and b are close together then (a/b) loses alot of precison
- >(since the result is an int).
- >
- >Does anyone know how I could do an integer division without losing
- >precision. I cannot use result=(((a<<8)/b)<<7) or any other tricks
- >because a may be close to 32768<<15 already...
- >Perhaps if I can find the reciprocal of b in the above representation
- >and multiply it by a then I'll not lose the as much precision, but I
- >don't know how to do this (find the reciprocal) :-(
-
- Two quick suggestions:
-
- (1) The basic problem is that integer division stops when it determines the
- bit in the units place, whereas you want it to go on and calculate
- another 15 fractional binary places. So rather than using C's integer
- divide operator, write a long division routine using subtractions and
- shifts and force it to go on 15 iterations longer than normal. (This
- won't be much slower than C's integer divide, which essentially uses a
- long division routine anyway).
-
- If speed is very important, you'll probably want to either (a) write a
- well-optimised assembler routine; or (b) write it as a C macro, so that
- the compiler includes it inline. In the latter case, you may want to
- look carefully at the assembler produced by the compiler for an instance
- of your macro and experiment a bit - careful choice of the details of
- the long division algorithm can make a significant difference to the
- speed of the division.
-
- Incidentally, you may want to watch out for significant bits being lost
- during the last 15 iterations: this is perfectly possible and means that
- your result has overflowed.
-
- (2) Use a Newton-Raphson iteration to find the reciprocal of your divisor
- and then multiply - this presupposes that you've got an efficient way to
- do multiplications, both for the Newton-Raphson iteration and for the
- final multiplication. The iteration to find the reciprocal of a number A
- is (in pseudo-code - you'll have to translate to C):
-
- newx = initial approximation to reciprocal;
- repeat
- x = newx;
- newx = x * (2 - A*x);
- until x and newx sufficiently close together;
-
- The main thing to beware of in this is that your initial approximation
- must be reasonably close to the reciprocal - it must be greater than 0
- and less than twice the reciprocal. Such an approximation can be found
- by finding the most significant 1 in A and choosing a value based on
- this - e.g. if A lies in the range 1.0 <= A < 2.0, its most significant
- 1 is at bit position 16, and a suitable starting point for the iteration
- would be 0.75. This gives us:
-
- Most sign. bit position Range for A Suitable starting point
- -----------------------------------------------------------------------
- :
- :
- 9 2^(-7) <= A < 2^(-6) 1.5*2^6
- 10 2^(-6) <= A < 2^(-5) 1.5*2^5
- 11 2^(-5) <= A < 2^(-4) 1.5*2^4
- 12 2^(-4) <= A < 2^(-3) 1.5*2^3
- 13 2^(-3) <= A < 2^(-2) 1.5*2^2
- 14 2^(-2) <= A < 2^(-1) 1.5*2^1
- 15 2^(-1) <= A < 2^0 1.5*2^0
- 16 2^0 <= A < 2^1 1.5*2^(-1)
- 17 2^1 <= A < 2^2 1.5*2^(-2)
- 18 2^2 <= A < 2^3 1.5*2^(-3)
- 19 2^3 <= A < 2^4 1.5*2^(-4)
- 20 2^4 <= A < 2^5 1.5*2^(-5)
- 21 2^5 <= A < 2^6 1.5*2^(-6)
- 22 2^6 <= A < 2^7 1.5*2^(-7)
- :
- :
-
- Watch out for very big values - e.g. 2^16 is representable in this
- number system, but 2^(-16) isn't!
-
- Hope these suggestions help.
-
- David Seal
- dseal@acorn.co.uk or dseal@armltd.uucp
-
-